Awesome LLM Security

A curation of awesome tools, documents and projects about LLM Security.

Contributions are always welcome. Please read the Contribution Guidelines before contributing.

Papers

White-box attack

"Visual Adversarial Examples Jailbreak Large Language Models", 2023-06, AAAI(Oral) 24, multi-modal, [paper] [repo]
"Are aligned neural networks adversarially aligned?", 2023-06, NeurIPS(Poster) 23, multi-modal, [paper]
"(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs", 2023-07, multi-modal [paper]
"Universal and Transferable Adversarial Attacks on Aligned Language Models", 2023-07, transfer, [paper] [repo] [page]
"Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models", 2023-07, multi-modal, [paper]
"Image Hijacking: Adversarial Images can Control Generative Models at Runtime", 2023-09, multi-modal, [paper] [repo] [site]
"Weak-to-Strong Jailbreaking on Large Language Models", 2024-04, token-prob, [paper] [repo]

Black-box attack

"Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", 2023-02, AISec@CCS 23 [paper]
"Jailbroken: How Does LLM Safety Training Fail?", 2023-07, NeurIPS(Oral) 23, [paper]
"Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models", 2023-07, [paper] [repo]
"Effective Prompt Extraction from Language Models", 2023-07, prompt-extraction, [paper]
"Multi-step Jailbreaking Privacy Attacks on ChatGPT", 2023-04, EMNLP 23, privacy, [paper]
"LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?", 2023-07, [paper]
"Jailbreaking chatgpt via prompt engineering: An empirical study", 2023-05, [paper]
"Prompt Injection attack against LLM-integrated Applications", 2023-06, [paper] [repo]
"MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots", 2023-07, time-side-channel, [paper]
"GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher", 2023-08, ICLR 24, cipher, [paper] [repo]
"Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities", 2023-08, [paper]
"Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs", 2023-08, [paper] [repo] [dataset]
"Detecting Language Model Attacks with Perplexity", 2023-08, [paper]
"Open Sesame! Universal Black Box Jailbreaking of Large Language Models", 2023-09, gene-algorithm, [paper]
"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!", 2023-10, ICLR(oral) 24, [paper] [repo] [site] [dataset]
"AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models", 2023-10, ICLR(poster) 24, gene-algorithm, new-criterion, [paper]
"Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations", 2023-10, CoRR 23, ICL, [paper]
"Multilingual Jailbreak Challenges in Large Language Models", 2023-10, ICLR(poster) 24, [paper] [repo]
"Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation", 2023-11, SoLaR(poster) 24, [paper]
"DeepInception: Hypnotize Large Language Model to Be Jailbreaker", 2023-11, [paper] [repo] [site]
"A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily", 2023-11, NAACL 24, [paper] [repo]
"AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models", 2023-10, [paper]
"Language Model Inversion", 2023-11, ICLR(poster) 24, [paper] [repo]
"An LLM can Fool Itself: A Prompt-Based Adversarial Attack", 2023-10, ICLR(poster) 24, [paper] [repo]
"GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts", 2023-09, [paper] [repo] [site]
"Many-shot Jailbreaking", 2024-04, [paper]
"Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [paper] [repo]

Backdoor attack

"BITE: Textual Backdoor Attacks with Iterative Trigger Injection", 2022-05, ACL 23, defense [paper]
"Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models", 2023-05, EMNLP 23, [paper]
"Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection", 2023-07, NAACL 24, [paper] [repo] [site]

Defense

"Baseline Defenses for Adversarial Attacks Against Aligned Language Models", 2023-09, [paper] [repo]
"LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked", 2023-08, ICLR 24 Tiny Paper, self-filtered, [paper] [repo] [site]
"Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM", 2023-09, random-mask-filter, [paper]
"Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models", 2023-12, [paper] [repo]
"AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks", 2024-03, [paper] [repo]
"Protecting Your LLMs with Information Bottleneck", 2024-04, [paper] [repo]
"PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition", 2024-05, ICML 24, [paper] [repo]
“Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs”, 2024-06, [paper]

Platform Security

"LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins", 2023-09, [paper] [repo]

Survey

"Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks", 2023-10, ACL 24, [paper]
"Security and Privacy Challenges of Large Language Models: A Survey", 2024-02, [paper]
"Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models", 2024-03, [paper]
"Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)", 2024-07, [paper]

Tools

Plexiglass: a security toolbox for testing and safeguarding LLMs
PurpleLlama: set of tools to assess and improve LLM security.
Rebuff: a self-hardening prompt injection detector
Garak: a LLM vulnerability scanner
LLMFuzzer: a fuzzing framework for LLMs
LLM Guard: a security toolkit for LLM Interactions
Vigil: a LLM prompt injection detection toolkit
jailbreak-evaluation: an easy-to-use Python package for language model jailbreak evaluation
Prompt Fuzzer: the open-source tool to help you harden your GenAI applications
WhistleBlower: open-source tool designed to infer the system prompt of an AI agent based on its generated text outputs.

Articles

Other Awesome Projects

(0din GenAI Bug Bounty from Mozilla)(https://0din.ai): The 0Day Investigative Network is a bug bounty program focusing on flaws within GenAI models. Vulnerability classes include Prompt Injection, Training Data Poisoning, DoS, and more.
Gandalf: a prompt injection wargame
LangChain vulnerable to code injection - CVE-2023-29374
LLM Security startups
Adversarial Prompting
Epivolis: a prompt injection aware chatbot designed to mitigate adversarial efforts
LLM Security Problems at DEFCON31 Quals: the world's top security competition
PromptBounty.io
PALLMs (Payloads for Attacking Large Language Models)

Other Useful Resources

Twitter: @llm_sec
Blog: LLM Security authored by @llm_sec
Blog: Embrace The Red
Blog: Kai's Blog
Newsletter: AI safety takes
Newsletter & Blog: Hackstery

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome LLM Security

Table of Contents

Papers

White-box attack

Black-box attack

Backdoor attack

Defense

Platform Security

Survey

Tools

Articles

Other Awesome Projects

Other Useful Resources

About

Contributors 17

corca-ai/awesome-llm-security

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Security

Table of Contents

Papers

White-box attack

Black-box attack

Backdoor attack

Defense

Platform Security

Survey

Tools

Articles

Other Awesome Projects

Other Useful Resources

About

Topics

Resources

Stars

Watchers

Forks

Contributors 17